friends-tv-series-font

friends-tv-series-font



friends-tv-series-font


Alberto Munguia (am5334)

Bernardo Lopez (bl2786)

Ivan Ugalde (du2160)

The code for this analysis is published in a public Git Hub repository.

I. Introduction

i) Friends iconic sitcom

Friends is an American situation comedy, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast starring Jennifer Aniston (Rachel), Courteney Cox (Monica), Lisa Kudrow (Phoebe), Matt LeBlanc (Joey), Matthew Perry (Chandler) and David Schwimmer (Ross).

The show revolved around six friends in their 20s and 30s who lived in Manhattan, New York City. Rachel Green, a sheltered but friendly woman, flees her wedding day and her rich yet unfulfilling life, and finds childhood friend Monica Geller, a tightly-wound but caring chef. After Rachel becomes a waitress at coffee house Central Perk, she and Monica become roommates at Monica’s apartment located directly above Central Perk, and Rachel joins Monica’s group of single people in their mid-20s: her previous roommate Phoebe Buffay, an eccentric, innocent masseuse; her neighbor across the hall Joey Tribbiani, a dim-witted yet loyal struggling actor and womanizer; Joey’s roommate Chandler Bing, a sarcastic, self-deprecating IT manager; and her older brother and Chandler’s college roommate Ross Geller, a sweet-natured but insecure paleontologist.

Friends received positive reviews throughout its run and became one of the most popular sitcoms of its time. The series won many awards and was nominated for 63 Prime time Emmy Awards. The series was also very successful in the ratings, consistently ranking in the top ten in the final prime time ratings. Friends has made a large cultural impact, and has become an the model to follow for sitcoms.

ii) Motivation & Questions

As teenagers at the beginning of the century, we were heavily influenced by the Friends phenomenon and became huge fans of the sitcom. We decided to work on this project to challenge through a data analysis our preconceptions of the show and discover hidden insights. The questions that guide our quantitative assessment are the following:

  • Can we categorize by importance all the appearing characters of the sitcom? This question at first glance could seem simple but under the assumption that we do not possess any previous knowledge of the sitcom and considering that over the ten seasons more than 800 characters appeared in the show, the analysis represents a challenge.

  • Can we identify and quantify the interactions between the main and secondary characters? What would be an appropriate way to quantify and visualize these relationships?

  • Which are the most recurrent topics through the seasons and episodes of the show? And how the thematic of the show evolved over its ten seasons? Can we extract this information from the dialogues of the show?

  • Can we determine the contribution of each character to the popularity of the sitcom? Does the participation of each character influence the viewer’s preferences?

iii) R Libraries, Machine Learning techniques & Other resources

We have use the next R libraries for the development of this project :

II. Data sources

i) Primary sources

The primary data sources that we used for our project and that we consider that have an adequate quality are:

  • Transcripts: For the transcripts, we used an open resource built by fans of the sitcom that has been compiled in a Git Hub repository. The repository contains all the dialogues of the characters for the 231 episodes of the tv-show. The data is organized in Html documents.. The data can be accessed via: https://fangj.github.io/friends/. If you want to see how the transcripts are originally presented please click here.

  • Ratings: For the ratings, we have used the IMDb Datasets which are available for access to customers for personal and non-commercial use. The data is structured in seven compressed CSV files that contain general information of the show (genre, start year, end year, episode duration, etc.), and specific information of each episode (title, rating, characters, crew, etc.). A relevant characteristic of the database is that it is refreshed daily. We have made the consultation of the Data on November 10, 2019. The datasets can be accessed via: https://datasets.imdbws.com/

ii) Data quality and challenges

  • IMDb Dataset:

    • The first obstacle that we faced with the IMDb datasets was the size of the datasets, some with millions of rows with information of TV shows, shorts, movies, documentaries, and other entertainment formats. Due to their size, it was not possible to store them in GitHub.

    • The second obstacle was to identify the data corresponding to our case of study. For example, we searched in the dataset only by name ‘Friends’ and we found 178 results of TV shows and movies called ‘Friends’. It was necessary to understand and do some research on the years of beginning and end of the series to refine the search.

    • Another obstacle was that the ID for TV-series across the seven IMDb datasets was not uniform. For example, in the dataset corresponding to the titles of the TV-series, the ID to identify the show is named “tconst”, while on the dataset that where we can get the ID of the episode the name correspond to the ID of the episode, and the ID for the TV-series is called “parentTconst”. These errors were identified through the exploration of the datasets.

  • Transcript Dataset:

    • The main obstacle of the dialogue dataset is that not all the HTML files share the same format. We have overcome this difficulty by incorporating special cases in our scraping code that took into account the special cases that we have detected.

    • The second difficulty that we have experienced in the dialogue dataset is the cleaning of the dialogues. We tried to standardize as much as possible the content of the dialogues, by identifying different names for the same character, common typos and regular expressions that could hinder our analysis.

    You can follow the scraping code that lead to the following data frame by looking into “EVAD_friends.Rmd” file in the Git Hub repository.

url <- "https://fangj.github.io/friends/"
paths_allowed(url)
## # A tibble: 6 x 5
##   episode_id line_num scene character line                                 
##   <chr>         <dbl> <dbl> <chr>     <chr>                                
## 1 1 : 01            1     1 MONICA    There's nothing to tell! He's just s…
## 2 1 : 01            2     1 JOEY      C'mon, you're going out with the guy…
## 3 1 : 01            3     1 CHANDLER  All right Joey, be nice. So does he …
## 4 1 : 01            4     1 PHOEBE    Wait, does he eat chalk?             
## 5 1 : 01            5     1 PHOEBE    Just, 'cause, I don't want her to go…
## 6 1 : 01            6     1 MONICA    Okay, everybody relax. This is not e…

III. Data transformation

i) Building a Metabase

  • Step 1:
    • Step 1.1: After scraping the dialogues data we had to do some data cleaning and transformation to create a data frame with the name of the character, dialogue line, scene, episode and words count.
    First we added a word count to dialogues.
## # A tibble: 6 x 2
##   line                                                                words
##   <chr>                                                               <int>
## 1 There's nothing to tell! He's just some guy I work with!               11
## 2 C'mon, you're going out with the guy! There's gotta be something w…    14
## 3 All right Joey, be nice. So does he have a hump? A hump and a hair…    16
## 4 Wait, does he eat chalk?                                                5
## 5 Just, 'cause, I don't want her to go through what I went through w…    16
## 6 Okay, everybody relax. This is not even a date. It's just two peop…    21

We can see that some episodes where put together in the same file:

## [1] "2 : 12-13"  "6 : 15-16"  "9 : 23-24"  "10 : 17-18"

We split those episodes into two different ones and for more clarity, added season and episode columns.

Then we had to correct some character names that had typos and we removed some lines that the scraping code caught that are not dialogues.

## # A tibble: 6 x 8
##   episode_id line_num scene character line             words season episode
##   <chr>         <dbl> <dbl> <chr>     <chr>            <int>  <int>   <int>
## 1 1 : 01            1     1 MONICA    There's nothing…    11      1       1
## 2 1 : 01            2     1 JOEY      C'mon, you're g…    14      1       1
## 3 1 : 01            3     1 CHANDLER  All right Joey,…    16      1       1
## 4 1 : 01            4     1 PHOEBE    Wait, does he e…     5      1       1
## 5 1 : 01            5     1 PHOEBE    Just, 'cause, I…    16      1       1
## 6 1 : 01            6     1 MONICA    Okay, everybody…    21      1       1

More data transformation were used and explained in each of the Results subsections.

Step 1.2: We extracted, decompressed and saved as dataframes. The 3 tables from IMDb that we have used in our analysis are those that allowed us to extract the information related to the rating of each episode.

title.ratings.tsv.gz

## 'data.frame':    990485 obs. of  3 variables:
##  $ tconst       : Factor w/ 990485 levels "tt0000001","tt0000002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ averageRating: num  5.6 6.1 6.5 6.2 6.1 5.2 5.5 5.4 5.4 6.9 ...
##  $ numVotes     : int  1547 187 1204 114 1932 102 615 1663 81 5539 ...

title.episode.tsv.gz

## 'data.frame':    4425501 obs. of  4 variables:
##  $ tconst       : Factor w/ 4425501 levels "tt0041951","tt0042816",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ parentTconst : Factor w/ 128694 levels "tt0038276","tt0039122",..: 59 34810 34810 22 34810 34810 34810 34274 34810 34810 ...
##  $ seasonNumber : Factor w/ 244 levels "\\N","1","10",..: 2 2 1 131 103 103 131 2 131 179 ...
##  $ episodeNumber: Factor w/ 15556 levels "\\N","0","1",..: 14445 6320 1 9103 6209 13326 7769 11103 9547 1115 ...
  • Step 2: Join the 3 data frames of IMDb to create an intermediate dataframe.Notice that this data frame will contain the average rating of IMDb per episode. Furthermore, we created a suitable key to join this dataframe with the dataframe of that contain the dialogues.
## 'data.frame':    236 obs. of  9 variables:
##  $ parentTconst : chr  "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
##  $ titleType    : Factor w/ 1 level "tvSeries": 1 1 1 1 1 1 1 1 1 1 ...
##  $ primaryTitle : Factor w/ 1 level "Friends": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tconst       : chr  "tt0583431" "tt0583432" "tt0583433" "tt0583434" ...
##  $ seasonNumber : Factor w/ 244 levels "\\N","1","10",..: 212 3 3 3 223 3 190 201 103 190 ...
##  $ episodeNumber: Factor w/ 15556 levels "\\N","0","1",..: 13326 14445 6320 6431 3 3 3 3 2226 7769 ...
##  $ averageRating: num  8.2 8.6 9.5 9.7 8.7 8.5 8.9 8.7 8.6 8.8 ...
##  $ numVotes     : int  2568 2641 5829 9699 2783 2889 3376 2962 3472 3100 ...
##  $ episode_id   : chr  "7 : 08" "10 : 09" "10 : 17" "10 : 18" ...
  • Step 3: With a Left-Join we created the Metabase that had as primary dataframe the dialogues and as secondary dataframe the ratings from IMDb.
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 61264 obs. of  12 variables:
##  $ episode_id   : chr  "1 : 01" "1 : 01" "1 : 01" "1 : 01" ...
##  $ line_num     : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ scene        : num  1 1 1 1 1 1 1 2 2 2 ...
##  $ character    : chr  "MONICA" "JOEY" "CHANDLER" "PHOEBE" ...
##  $ line         : chr  "There's nothing to tell! He's just some guy I work with!" "C'mon, you're going out with the guy! There's gotta be something wrong with him!" "All right Joey, be nice. So does he have a hump? A hump and a hairpiece?" "Wait, does he eat chalk?" ...
##  $ words        : int  11 14 16 5 16 21 6 22 5 11 ...
##  $ season       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ episode      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ parentTconst : chr  "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
##  $ tconst       : chr  "tt0583459" "tt0583459" "tt0583459" "tt0583459" ...
##  $ averageRating: num  8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ...
##  $ numVotes     : int  6098 6098 6098 6098 6098 6098 6098 6098 6098 6098 ...

IV. Missing values

To search for missing values we look at the number of missing values per columns in dialogues.

##    episode_id      line_num         scene     character          line 
##             0             0             0             0            61 
##         words        season       episode  parentTconst        tconst 
##             0             0             0             0             0 
## averageRating      numVotes 
##             0             0

There appear to be some missing values for line column. We will use visna from extracats library to see the pattern of missing values.

This missing values are due to a different formats in the Git Hub page used for web scraping. For example, lines like “Paolo: (something in Italian)” render a NA line because the scraping code removes everything between parenthesis.

We decided to fill those missing values with “”. By doing so we will keep the register for those characters dialogue.

V. Results

Labeling the main and secondary characters and its participation

  • The first big question that we want to explore is the categorization by importance by the participation of the characters? As we have previously mentioned this question seems to be naive, however, if we assume that we do not have any previous knowledge of the sitcom and considering that over the ten seasons more than 800 characters appeared in the show, the analysis is far from being a naive exercise.

To answer this question we have used the unsupervised Machine Learning technique of K-Mean. Its objective s to label the data based on certain characteristics, in this case, we used the number of words, lines, and scenes. To accomplish this task we have used the libraries cluster, and base. Moreover, we have established established a priori the desired number of labels the we wanted for our data, for practicality terms we decides to set the size of the groups or k means.

From the k mean analysis obtain the following separation of characters: * Main Characters: As expected Rachel, Monica, Phoebe, Joey, Chandler and Ross constitute one group the has that on average has 1,680 scenes, 8469 lines and 87,498 words. * Secondary Characters: This group is composed by 33 characters, most of them are recurrent characters and guest stars. The average character in this group has on average 35 scenes, 133 lines and 1,228 words.

  • Other Characters: composed for those characters that are incidental or did not have a relevant importance in the sitcom. The average character in this group has 2 scenes, 7 lines and 64 words.

Centers:

##   Total_scene Total_lines Total_words vcluster
## 1 1679.833333  8469.33333 87783.33333        3
## 2   36.343750   135.81250  1251.81250        1
## 3    2.487365     7.34296    65.11432        2
  • Main characters participation

Friends is a TV show that tells the story of a group of six friends: Monica, Rachel, Phoebe, Chandler, Ross and Joey. Is one of these characters more important than others? We try to answer this question by looking at the number of lines for each of these main characters.

We can see that Rachel is the character with more lines and Phoebe is the character with less lines. Now we focus in the number of words instead of the number of lines.

Rachel and Ross are again the characters that speak the most and Phoebe the one with less words. We can see that Monica was number 3 for number lines but she is number 5 for number of words. This suggests that Monica’s lines tend to be shorter. The opposite happens with Joey. He is number 5 for number of lines, but he is third for number of words. This suggests his lines tend to be longer.

By looking into lines per episode distribution we find the following: * Monica’s distribution looks more narrow that the others. This indicates that there are few episodes in which Monica speaks a lot. * Chandler and Ross have large right tails, we infer that those characters have episodes in which they speak a lot. * Rachel and Ross have wider distributions.

Unraveling the character interactions

For the Network analysis, a special data structure is required. We established a definition of interaction between characters when they share the same scene. We must mention that the original data structure of the dialogues does not permit us to identify the exact interaction of the characters in the scene. Hence, we have assumed that all the characters that appeared in every scene interacted between them. Moreover, we have assumed that the interactions between the characters will be represented by an adjacency matrix where we can observe the number of interactions that each character has with the others.

With the library igraph) we were able to create the adjacency matrix of the characters and quantify the interactions among the 869 characters. For the 6 main characters the adjacency matrix look like this:

##          CHANDLER JOEY RACHEL ROSS MONICA PHOEBE
## CHANDLER        0  991    697  790   1051    729
## JOEY          991    0    747  764    758    746
## RACHEL        697  747      0  949    847    827
## ROSS          790  764    949    0    733    689
## MONICA       1051  758    847  733      0    875
## PHOEBE        729  746    827  689    875      0

Also, we Can visualize interactively the relationship between the Main and the Secondary Characters. The width of the graph represents the level of interactions among the characters, as you may perceive the interactions between the main characters is very balanced. Click and drag the vertices, or select the groups or the characters

Topic modelling using LDA

For topic modelling we will use the package textmineR. We will try to find the topic for each episode. To do so we will create a document for each episode, so we have to group lines by episode_id.

## # A tibble: 6 x 2
##   episode_id lines                                                         
##   <chr>      <chr>                                                         
## 1 1 : 01     There's nothing to tell! He's just some guy I work with! C'mo…
## 2 1 : 02     What you guys don't understand is, for us, kissing is as impo…
## 3 1 : 03     Hi guys! Hey, Pheebs! Hi! Hey. Oh, oh, how'd it go? Um, not s…
## 4 1 : 04     "Alright. Phoebe? Okay, okay. If I were omnipotent for a day,…
## 5 1 : 05     "Would you let it go? It's not that big a deal. Not that big …
## 6 1 : 06     Ooh! Look! Look! Look! Look, there's Joey's picture! This is …

Function CrateDtm creates a document term matrix. To do so we use a group of stopwords, words we don’t want to use because they are used frequently in English language and do not give insightful information.

We will use document term matrix to create a Term Document Frequency matrix that counts the number of times a term appears (term frequency) and the number of documents in which a term appears (document frequency).

These are the main terms ordered by term frequency:

##        term term_freq doc_freq
## 11509  good      1714      231
## 11508   god      1677      228
## 11507  guys      1468      225
## 11506 great      1342      225
## 11505  time      1215      229
## 11504  back      1125      223

Now we fit a Latent Dirichlet allocation model in which we will try to fit 15 topics into the collection of episodes. This will return to main matrices:

  • theta: Matrix with the probability of topic per document -> P(topic | document).
  • phi: Matrix with the probability of term per topic -> P(term | topic).
## [1] "Theta:"
##                t_1          t_2          t_3         t_4        t_5
## 1 : 01 0.003959440 0.2482858522 0.0348623853 0.069628199 0.08025109
## 1 : 02 0.017833456 0.0281503316 0.0325718497 0.237435520 0.03109801
## 1 : 03 0.011225296 0.0001581028 0.1693280632 0.003320158 0.02387352
## 1 : 04 0.004108681 0.0014579192 0.0557985421 0.139297548 0.09158383
## 1 : 05 0.007375271 0.0001446132 0.0001446132 0.111496746 0.11005061
## 1 : 06 0.028275352 0.0031088083 0.0549222798 0.001628423 0.39393042
## [1] "Phi:"
##          met_guy       gellar meeting_meeting cameras_smell    potpourri
## t_1 8.583028e-06 8.583028e-06    8.583028e-06  8.583028e-06 8.583028e-06
## t_2 8.660333e-06 8.660333e-06    8.660333e-06  1.818670e-04 8.660333e-06
## t_3 5.786066e-06 5.786066e-06    5.786066e-06  5.786066e-06 5.786066e-06
## t_4 9.490457e-06 9.490457e-06    9.490457e-06  9.490457e-06 9.490457e-06
## t_5 6.904695e-06 6.904695e-06    6.904695e-06  6.904695e-06 6.904695e-06
## t_6 9.590578e-06 9.590578e-06    9.590578e-06  9.590578e-06 9.590578e-06

Now the 15 topics have been created. To know about the topics quality we look into the topic coherence, this is a measure of how associated are words in a topic.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.002827  0.003716  0.019262  0.035778  0.056138  0.115704

We will use phi to get the top 5 terms per topic.

##      [,1]      [,2]           [,3]       [,4]        [,5]     
## t_1  "sister"  "meet"         "give"     "part"      "honey"  
## t_2  "wedding" "married"      "guys"     "love"      "parents"
## t_3  "guys"    "game"         "money"    "apartment" "move"   
## t_4  "janice"  "carol"        "guy"      "woman"     "susan"  
## t_5  "guys"    "big"          "bye"      "julie"     "listen" 
## t_6  "emma"    "mike"         "guys"     "baby"      "love"   
## t_7  "monkey"  "marcel"       "people"   "joke"      "drake"  
## t_8  "dad"     "birthday"     "mom"      "guys"      "party"  
## t_9  "baby"    "ring"         "pregnant" "guys"      "god"    
## t_10 "job"     "guy"          "guys"     "great"     "good"   
## t_11 "cat"     "thing"        "mark"     "god"       "love"   
## t_12 "guys"    "love"         "good"     "plane"     "bob"    
## t_13 "guys"    "thanksgiving" "year"     "dog"       "school" 
## t_14 "good"    "god"          "time"     "great"     "wait"   
## t_15 "emily"   "married"      "love"     "london"    "pheebs"

The next step is to compute the topic prevalence using theta. Topic prevalence indicate the most frequent topics in the TV show.

Finally, we get a summary for the complete LDA model.

##      topic coherence prevalence                             top_terms
## t_14  t_14     0.000     40.628          good, god, time, great, wait
## t_10  t_10     0.003      6.197           job, guy, guys, great, good
## t_3    t_3     0.009      5.902    guys, game, money, apartment, move
## t_5    t_5    -0.003      5.432         guys, big, bye, julie, listen
## t_8    t_8     0.064      4.690       dad, birthday, mom, guys, party
## t_9    t_9     0.019      4.507       baby, ring, pregnant, guys, god
## t_1    t_1     0.004      4.174       sister, meet, give, part, honey
## t_4    t_4     0.061      4.097      janice, carol, guy, woman, susan
## t_2    t_2     0.050      3.800 wedding, married, guys, love, parents
## t_15  t_15     0.116      3.640  emily, married, love, london, pheebs
## t_7    t_7     0.052      3.601   monkey, marcel, people, joke, drake
## t_11  t_11     0.006      3.494           cat, thing, mark, god, love
## t_13  t_13     0.051      3.460 guys, thanksgiving, year, dog, school
## t_6    t_6     0.108      3.349          emma, mike, guys, baby, love
## t_12  t_12    -0.002      3.028          guys, love, good, plane, bob

We can see that the most prevalent (frequent) topic has words like “good”, “god”,“great”, “time”. This makes sense, this words are very frequent in the TV show and that is why they give very little information about the topic. That is why the coherence is 0.0.

The other topics in the model have less prevalence but they are more coherent. If you are a fan of the show and if you read the list of top terms, we are sure you can remember episodes in which those terms were important.

To find those important episodes we created a d3 tool. We wrote a csv file using theta in which, for each episode and topic we put the probability of that topic given the episode and the top terms of that topic.

##       id topic        value topic_num
## 1 1 : 01   t_1 0.0039594399         1
## 2 1 : 01  t_14 0.3690004829        14
## 3 1 : 01   t_6 0.0000965717         6
## 4 1 : 01  t_15 0.0020280058        15
## 5 1 : 01  t_13 0.0242394978        13
## 6 1 : 01   t_7 0.0551424433         7
##                               top_terms                   name
## 1       sister, meet, give, part, honey Monica Gets A Roommate
## 2          good, god, time, great, wait Monica Gets A Roommate
## 3          emma, mike, guys, baby, love Monica Gets A Roommate
## 4  emily, married, love, london, pheebs Monica Gets A Roommate
## 5 guys, thanksgiving, year, dog, school Monica Gets A Roommate
## 6   monkey, marcel, people, joke, drake Monica Gets A Roommate

Interactive d3 topic modeling Friends episodes topics

Top episodes by topic

Click the circles to see the episodes with highest P(topic | episode).

Features driving the rating

To look for the main features that drive the rating up or down, it is useful to start by observing its temporal structure. The following graph is interactive:

Click on the season title to (de)activate each series

  • There is a generalized behavior among the 10 seasons:
    • Episode 1 has high rating and immediately falls into a valley around 8.4
    • Through episodes 6-9 there’s usually a peak followed of another (larger) valley
    • Finally, the last episode has high rating

Next, to create a cleaner view of the data, we create a boxplot

The boxplots not only help to confirm in a cleaner graph the behavior observed previously, but also provides additional insights like:

  • The perceived quality (in terms of rating) of episodes 7 to 9 varies a lot from season to season.
  • The 10th episode for all season has low rating surrounding 8.2. These were traditionally aired through the holidays season in December.
  • Two highest ratings:
    • Special episode after the 1996 Super Bowl
    • Final episode (Season 10, episode 18)
  • Two lowest ratings ():
    • Season 10, episode 10: “Christmas in Tulsa”
    • Season 6, episode 20: “Mac and Cheese” happens in the 19-21 episode valley

A likely explanation for the behavior on the last 7 episodes of all seasons is that writers may have prepared some intricate plot (an not necessarily amusing) to be climaxed on the lasts episodes.

A feature interesting to explore y how the number of viewers and voters relate with the Average Rating.

The scatterplot matrix above helps to see that the better the episode is, the likelier that people will invest their time to rate it. Specifically this happens with an exponential relationship. In contrast, the the number of viewers show a small positive linear correlation.

This TV show is commonly perceived as having constantly increasing ratings and number of followers. However we can show that it is not completely true, at least while it was aired on TV before streaming services like Netflix became the big players they are today.

The following boxplot graph shows that the hype for the show and the main actor’s salaries were not backed by the number of people following the weekly episodes.

Clearly, the most successful season, when the actors were paid in the range from $20,000 to $40,000 per episode, was season 2 with about 50% (approx) more viewers than Season 10 when they were paid $1 million per episode.

Continuing with the number of viewers. When the last episode of a season is considered a good one, it is expected that people will watch the beginning of the next one and then loose stamina (until the last episodes of the current season).

An additional insight: episodes 19-22 usually were aired around ____________, which explains the low number of fans watching. Here the graph.

Probably the most ubiquitous discussion among Friends fans is which of the main characters is the best. Here we are going to define best as the character that drives the rating with more strength. To do so we will try to find if there is a relation between the number of lines of each character to the rating for each episode, as well as the interactions among all of the characters.

There is no surprise that Monica and Chandler are the characters with most interactions between them since they were a couple for a longer time than Ross and Rachel (second place).

As to the rating, we can see that individual participation don’t hold a high correlation with the rating of an episode. However, we should notice that all of them are positive and that Ross is the character with the highest one, which aligns with the previous analysis of interactions.

To test how statistically significant are these results we ran correlation hypothesis tests. In consideration that the relation will not necessarily linear and that there are many ties in rating (due to rounding) we decided to run both parametric (Pearson) and not-parametric (Spearman) hypothesis test for \(H_0: \rho \leq 0\) vs \(H_1:\rho>0\) that resulted in the following p-values:

p-values
Spearman Pearson
CHANDLER 0.0218936 0.1678547
JOEY 0.0429544 0.0969200
MONICA 0.0220195 0.0029182
PHOEBE 0.5644974 0.0411303
RACHEL 0.0608283 0.0328136
ROSS 0.0005194 0.0000345

Ross seems to be the most relevant character given the number of lines and scenes in which he participates, that he has the highest correlation with the rating and that the significantly greater than zero under both hypothesis tests.

VI. Interactive component

We included four elements that are interactive:

VII. Conclusion

This project turned out to be a complex challenge from the point of data management since we had to move from semi-structured data such as HTML files to structured databases that could be exploitable for the purposes that the project requested.

Additionally, we believe that asking complex questions for a topic as popular as Friends was a challenge. That is why we decided to use unsupervised machine learning techniques, which allowed us to perform deep and objective analyzes under the premise that complete ignorance of the series of which we were a priori fans.

Finally, we believe that the data to which we have access has the potential to allow us to make more complete and complex analyzes. For example, we would have liked to have had more time and resources to carry out a deeper analysis of the temporal interactions of the characters under a graphic theory perspective.